Classification

9/24/2020

Agenda

Why not linear regression?
Logistic regression
Multinomial logistic regression
Evaluating accuracy
Bayes theorem, LDA, QDA
k Nearest Neighbours

Recap

What we saw last time

Logistic regression and how to interpret it
Types of error: false positives, false negatives
How to evaluate accuracy: ROC curve

Logistic Regression

You can think of logistic regression as a parametric model for \[\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\] as the linear regression problem was a parametric model for \[\mathbb{E}(Y = y \mid \mathbf{X} = \mathbf{x})\]

The binary logistic regression, in particular, models directly the conditional probability of \(Y = 1 \mid \mathbf{X}\), using \[\mathbb{P}(Y_{i} = 1 \mid \mathbf{X}_{i} = \mathbf{x}_{i}) = F(\beta_{0} + \beta_{1} x_{i1} + \dots + \beta_{p} x_{ip}).\]

Conditional Probability and Classification

We can use \(\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\) to classify \(Y\) given a new value of \(\mathbf{X}\). The most obvious thing to do is to predict the \(y\) that has the highest probability.
If there are only two possible outcomes for \(Y\), we are just picking the one with \(\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x}) > 0.5\), as we did for simple logistic regression.
Given \(\mathbf{X} = \mathbf{x}^\star\), we can predict \(Y\) to be \(y^\star\), where \[\mathbb{P}(Y = y^\star \mid \mathbf{X} = \mathbf{x}^\star) = \text{max}_y \ \mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x}^\star)\]

Conditional Probability and Classification

The logistic regression models directly the conditional probability \(\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\)
A number of classification techniques work by specifying the marginal distribution of \(Y\), the conditional distribution of \(\mathbf{X}\) given \(Y\) and then using Bayes Theorem to compute the conditional distribution of \(Y\) given \(\mathbf{X}\)
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis (LDA and QDA)

Bayes theorem

Bayes Theorem simply says that if I give you \(p(y)\) and \(p(x \mid y)\), you can compute \(p(y \mid x)\). Remember the basic rules of probability?

\[\begin{aligned} \mathbb{P}(Y = k \mid X = x) &= \dfrac{\mathbb{P}(X = x \text{ and } Y = k)}{\mathbb{P}(X = x)} \\ &= \dfrac{\mathbb{P}(X = x \mid Y = k) \mathbb{P}(Y = k)}{\sum_{l=1}^K \mathbb{P}(X = x \mid Y = l) \mathbb{P}(Y = l)} \\ &= \dfrac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} \end{aligned}\]

Bayes theorem

\[\underbrace{p_k(x)}_\text{posterior} = \mathbb{P}(Y = k \mid X = x) = \dfrac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} \propto \underbrace{\pi_k}_\text{prior} \cdot \underbrace{f_k(x)}_\text{likelihood}\]

\(f_k(x)\) is the likelihood for \(X\) in class \(k\) (“how are the features distributed under each of the classes?”). We will use normal densities for these, separately in each class
\(\pi_k\) is the prior probability for class \(k\): it is how likely you think \(Y = k\) is before you see \(X = x\)
\(p_k(x) = \mathbb{P}(Y = k \mid X = x)\) is called the posterior probability that \(Y = k\): it is how likely you think \(Y = k\) is after you see \(X = x\)

Posterior \(\propto\) prior \(\times\) likelihood

Bayes theorem (example)

Events:

A: have Covid
B: test positive for Covid

Goal: find \(\mathbb{P}(A \mid B)\), i.e. the probability of having Covid given a positive test, using

\(\mathbb{P}(B \mid A) = 0.95\): probability of positive test when having Covid (true positive rate)
\(\mathbb{P}(A) = 0.01\): probability of having Covid (prevalence in the population)
\(\mathbb{P}(B \mid \sim A) = 0.01\): probability of positive test when NOT having Covid (false positive rate)

Bayes theorem (example)

\[ \begin{align} \mathbb{P}(A \mid B) &= \dfrac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B \mid A) \mathbb{P}(A) + \mathbb{P}(B \mid \sim A) \mathbb{P}(\sim A)} \\ &= \dfrac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.01 \cdot 0.99} \\ \\ &= 0.49 \end{align}\]

Multiclass classification

Discriminant Analysis

The approach is to model the distribution of \(X\) in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(\mathbb{P}(Y \mid X)\)
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis
However, this approach is quite general, and other distributions can be used as well

LDA, QDA, Naive Bayes

Linear Discriminant Analysis, \(p = 1\)

We classify a new point according to which density is the highest
When the priors are different, we take them into account as well, and compare \(\pi_k f_k(x)\)
On the right, we upweight the purple class - the decision boundary has shifted to the right

Step 1

Estimate the Gaussian distributions \(f_0(x), f_1(x)\)

Step 2

Incorporate prior belief \(\pi_0 f_0(x), \pi_1 f_1(x)\)

Step 3

Normalize to get the posteriors \(p_0(x), p_1(x)\)

The math behind it

The Gaussian density has the form

\[f_k(x) = \dfrac{1}{\sqrt{2 \pi \sigma_k^2}} e^{- \frac{1}{2} \left( \frac{x - \mu_k}{\sigma_k} \right)^2}\]

Here \(\mu_k\) is the mean, and \(\sigma_k^2\) the variance, for the class \(k\).

LDA assumes that all the \(\sigma_k^2\) are the same, i.e. \(\sigma_k^2 = \sigma^2 \ \forall k\). Plugging this into Bayes formula, we get \[\mathbb{P}(Y = k \mid X = x) \propto \pi_k \dfrac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{1}{2} \left( \frac{x - \mu_k}{\sigma} \right)^2}\]

The math behind it

To classify at the value \(X = x\), we need to see which of the curves is largest. Taking logs, and discarding terms that do not depend on \(k\), we see that this is equivalent to assigning \(x\) to the class with the largest discriminant score

\[\delta_k(x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2 \sigma^2} + \log(\pi_k)\]

Note that \(\delta(x)\) is a linear function of \(x\). If there are \(K = 2\) classes and \(\pi_0 = \pi_1 = 0.5\), then one can see that the decision boundary is at \(x = \dfrac{\mu_0 + \mu_1}{2}\)

The math behind it

Typically we do not know the parameters for the normal distributions under each class, but we have training data. We can simply estimate the parameters and plug them into the decision rule.

The prior terms can be estimated via the sample proportions: \[\hat{\pi}_k = \dfrac{n_k}{n}\] We can estimate mean and variances with the corresponding sample statistics.

\[\begin{aligned} &\hat{\mu}_k = \dfrac{1}{n_k} \sum_{i : y_i = k} x_i \\ &\hat{\sigma}_{k}^{2} = \dfrac{1}{n_{k} - 1} \sum_{i : y_i = k} (x_i - \hat{\mu}_k)^2 \end{aligned}\]

Since LDA assumes common variance, we need an estimate: \(\hat{\sigma}^2\) is known as pooled variance \[\hat{\sigma}^2 = \dfrac{1}{n - K} \sum_{k = 1}^K (n_k - 1) \hat{\sigma}_k^2 \]

Sources of error

Theory says that the Bayes classifier, which classifies an observation to the class for which \(p_k(x)\) is largest, has the lowest possible error rate out of all classifiers.

Caveats (sources of error):

we need to make distributional assumptions on \(f_k(x)\)
we need to estimate the parameters
sometimes we have knowledge of the class membership probabilities \(\pi_1, \dots, \pi_K\) which can be used directly. Otherwise, LDA estimates \(\pi_k\) using the proportion of the training observations that belong to the \(k^{th}\) class.

Linear Discriminant Analysis, \(p > 1\)

The assumption is that \(X = (X_1,X_2, \dots ,X_p)\) is drawn from a multivariate Gaussian distribution, with a class-specific mean vector and a common covariance matrix.

Discriminant function: \[\delta_k(x) = x^\intercal \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^\intercal \Sigma^{-1} \mu_k + \log \pi_k\] This can be written as: \[\delta_k(x) = a_{k0} + a_{k1} x_1 + \dots + a_{kp} x_p\] which is a linear function of \(x\).

Linear Discriminant Analysis, \(p > 1\)

\[f(x) = \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} e^{- \frac{1}{2} (x - \mu)^\intercal \Sigma^{-1} (x - \mu)}\]

Given a joint normal distribution, the both the marginal distributions and the conditional distributions are normal, too!
The covariance matrix indicates the orientation of the elliptic shape
The marginal variances indicate the elongation of the elliptic shape along each axis

Linear Discriminant Analysis, \(p > 1\)

Illustration: \(p = 2\) and \(K = 3\)

Note that there are three lines representing the Bayes decision boundaries because there are three pairs of classes among the three classes.

Quadratic Discriminant Analysis

With Gaussians but different \(\Sigma_k\) in each class, i.e. \(f_k(x) = \mathcal{N}_p(\mu_k, \Sigma_k)\), we get quadratic discriminant analysis (QDA).

Naive Bayes

With \(f_k(x) = \prod_{j=1}^p f_{jk}(x)\) (conditional independence model) in each class we get naive Bayes. For Gaussian this means the \(\Sigma_k\) are diagonal.

Useful when \(p\) is large, and so multivariate methods like QDA and even LDA break down
Despite strong assumptions, naive Bayes often produces good classification results
It can be easily used with densities other than normal, e.g. \(f_{jk}(x)\) can be Bernoulli, Multinomial, Poisson, …

Similarities with Logistic regression

For a two-class problem, one can show that for LDA

\[\log \left( \dfrac{p_1(x)}{1 - p_1(x)} \right) = c_0 + c_1 x_1 + \dots + c_p x_p\]

so it has the same form as logistic regression.

The difference is in how the parameters are estimated:

Logistic regression uses the conditional likelihood based on \(\mathbb{P}(Y \mid X)\)
LDA uses the full likelihood based on \(\mathbb{P}(X, Y)\)

Despite these differences, in practice the results are often very similar.

Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.

To sum up

LDA: \(X \mid Y = k \sim \mathcal{N}_p (\mu_k, \Sigma)\)
QDA: \(X \mid Y = k \sim \mathcal{N}_p (\mu_k, \Sigma_k)\)
Naive Bayes: \(X \mid Y = k \sim \mathcal{N}_p (\mu_k, \sigma_k^2 \mathcal{I})\)

where \(\mathcal{N}_p(\mu, \Sigma)\) is the multivariate normal distribution.

k Nearest Neighbours

k-NN in practice

Given a positive integer \(k\) and a test observation \(x\), the k-NN classifier:

identifies the \(k\) points in the training data that are closest to \(x\), represented by \(\mathcal{N}_k(x)\)
estimates the conditional probability for class \(j\) as the fraction of points in \(\mathcal{N}_k(x)\) whose response values equal \(j\), i.e. \[\mathbb{P} (Y = j \mid X = x) = \frac{1}{K} \sum_{i \in \mathcal{N}_k(x)} \mathcal{I} (y_i = j)\]

Agenda

Recap

What we saw last time

Logistic Regression

Conditional Probability and Classification

Conditional Probability and Classification

Bayes theorem

Bayes theorem

Bayes theorem (example)

Bayes theorem (example)

Multiclass classification

Discriminant Analysis

LDA, QDA, Naive Bayes

Linear Discriminant Analysis, \(p = 1\)

Step 1

Estimate the Gaussian distributions \(f_0(x), f_1(x)\)

Step 2

Incorporate prior belief \(\pi_0 f_0(x), \pi_1 f_1(x)\)

Step 3

Normalize to get the posteriors \(p_0(x), p_1(x)\)

The math behind it

The math behind it

The math behind it

Sources of error

Linear Discriminant Analysis, \(p > 1\)

Linear Discriminant Analysis, \(p > 1\)

Linear Discriminant Analysis, \(p > 1\)

Illustration: \(p = 2\) and \(K = 3\)

Quadratic Discriminant Analysis

Naive Bayes

Similarities with Logistic regression

To sum up

k Nearest Neighbours

k-NN in practice

k-NN in practice

\(k = 3\)

k-NN in practice

\(k = 15\)

k-NN in practice

\(k = 60\)

Question time